Introduction

Where do you think autumn-spawning Atlantic herring spawns? In order to explore the spawning distribution of Atlantic herring, species distribution models were created for larvae of herring.

1. Collect data and build model

1.1 Collect & organize occurrence data

We collect data of Atlantic herring larvae occurrences from 2000 - 2020 from the International Herring Larvae Surveys (IHLS). The occurrence dataset is read from a parquet file stored in the data lake.

# the file to process
acf <- S3FileSystem$create(
  anonymous = T,
  scheme = "https",
  endpoint_override = "s3.waw3-1.cloudferro.com"
)

eurobis <- arrow::open_dataset(acf$path("emodnet/biology/eurobis_occurence_data/eurobisgeoparquet/eurobis_no_partition_sorted.parquet" ))
df_occs <- eurobis |> 
  filter(aphiaidaccepted==126417, datasetid==4423,
         longitude > -12, longitude < 10,
         latitude > 48, latitude < 62,
         observationdate >= as.POSIXct("2000-01-01"),
         observationdate <= as.POSIXct("2020-12-31")) |> 
  collect() 

glimpse(df_occs)
## Rows: 7,902
## Columns: 10
## $ occurrenceid            <int> 18246122, 18144150, 18003290, 18245864, 182458…
## $ datasetid               <int> 4423, 4423, 4423, 4423, 4423, 4423, 4423, 4423…
## $ observationdate         <dttm> 2016-01-18, 2013-01-14, 2009-01-18, 2015-09-2…
## $ longitude               <dbl> 4.500000, 4.500000, 4.500000, -3.833333, -3.83…
## $ latitude                <dbl> 52.41667, 52.55000, 52.58300, 58.58333, 58.750…
## $ scientificname          <chr> "Clupea harengus", "Clupea harengus", "Clupea …
## $ aphiaid                 <int> 126417, 126417, 126417, 126417, 126417, 126417…
## $ scientificname_accepted <chr> "Clupea harengus", "Clupea harengus", "Clupea …
## $ aphiaidaccepted         <int> 126417, 126417, 126417, 126417, 126417, 126417…
## $ taxonrank               <int> 220, 220, 220, 220, 220, 220, 220, 220, 220, 2…
df_occs <- df_occs %>% 
  select(Latitude=latitude,
         Longitude=longitude, 
         Time=observationdate) %>%
  mutate(year = year(Time),
         month = month(Time))

mapview(df_occs %>% dplyr::select(Longitude) %>% pull,
        df_occs %>% dplyr::select(Latitude) %>% pull,
        crs = "epsg:4326")
table(df_occs$month)
## 
##    1    9   10   12 
## 2293 4480   99 1030

1.2 Remove duplicates

We start with 7902 occurrences from the IHLS.

df_occs <- df_occs %>%
  distinct(year, month, Longitude, Latitude, .keep_all = TRUE)

After removal of duplicates, 5347 occurrences are left.

1.3 Reduce sampling bias

To deal with sampling bias, a spatial filtering technique is applied (Vollering et al. 2019). Here we thinned the occurrences so that each pair of occurrences has a minimum distance of 10 nautical miles or 18.52 km. This distance is the recommended distance between valid hauls in the DATRAS trawl surveys (ICES 2015).

# Thin towards distance of 10 NM or 18.52 km
# This is the distance between valid hauls in ICES trawl surveys

df_occs_thinned <- df_occs[0,] %>% select(-Time)
for (y in 1:length(2000:2020)) {
  for (m in 1:length(1:12)) {
    tmp_df <- df_occs %>% filter(year == c(2000:2020)[y],
                                 month == m)
    tmp_df_thinned <- spThin::thin(tmp_df %>% mutate(species = "Atlantic herring"),
                                   lat.col = "Latitude", long.col = "Longitude",
                                   spec.col = "species", thin.par = 18.52, reps = 1,
                                   write.files = FALSE, locs.thinned.list.return = TRUE,
                                   verbose = FALSE)[[1]]
    
    tmp_df_thinned <- tmp_df_thinned %>%
      mutate(year = c(2000:2020)[y],
             month = m)
    
    df_occs_thinned <- rbind(df_occs_thinned, tmp_df_thinned)
  }
  print(paste0(c(2000:2020)[y], " done"))
}

save(df_occs_thinned, file = "data/df_thinned.Rdata")
load("data/df_thinned.Rdata")

nrow(df_occs_thinned)
## [1] 3995
# 3995

mapview(df_occs_thinned %>% dplyr::select(Longitude) %>% pull,
        df_occs_thinned %>% dplyr::select(Latitude) %>% pull,
        crs = "epsg:4326")

After dealing with spatial bias, 3995 occurrences remain for modelling.

1.4 Create background points

We will use Maximum entropy (Maxent) models to fit the habitat suitability of herring larvae (S. J. Phillips, Dudı́k, and Schapire 2004). This modelling approach makes use of background points instead of (pseudo-)absences to characterize the study area (S. J. Phillips et al. 2009). Here, we derive background points using a buffer of 100 km around the occurrences (Barve et al. 2011).

# lets take a spatial buffer of 100 km around the occurrence points for this example

abs <- spatiotemp_pseudoabs(spatial.method = "buffer", temporal.method = "random",
                            occ.data = df_occs_thinned %>% mutate(x = Longitude, y = Latitude),
                            temporal.ext = c("2000-01-01", "2020-12-31"), spatial.buffer = 100000,
                            n.pseudoabs = 10000)

glimpse(abs)

#limit temporal values of background points to months where occurrences of larvae are present
set.seed(123)
abs <- abs %>%
  mutate(month = sample(unique(df_occs_thinned$month), nrow(abs), replace = TRUE)) %>%
  mutate(Longitude = x, Latitude = y) %>%
  select(-day, -x, -y)
glimpse(abs)

save(abs, file = "zarr_extraction/absences_save.Rdata")

1.5 Extract environmental values at occurrences and background points

At each occurrence and background point, we want to know what the value is of the environmental variables at this location and time.

Historically, this was proceeded by a cumbersome process of looking for and downloading relevant maps for each parameter. Then, environmental values could be extracted at point locations from these maps. Now, most of the available oceanic environmental data is stored in the datalake as .zarr files. These .zarr files can be accessed efficiently, directly at the location and time needed. This process greatly improves the speed of the process.

The spatial resolution of environmental variables should match the resolution of the species records during species distribution modelling (Sillero and Barbosa 2021). For this reason, the extraction method from .zarr files allows for the inclusion of a buffer around the requested coordinates. As such, environmental values are not only extracted at the requested longitude, latitude and time coordinate, but from all cells within the specified buffer around the location. Next, a user specified function (i.e. mean, max, min, median) is applied to these values to summarize this information into one value for each requested longitude, latitude and time coordinate. This way, an uncertainty is added to the location of the occurrence / (pseudo-)absence / background point. In this example, we apply a buffer of 10 NM, which is the resolution of the occurrence data.

load("zarr_extraction/SAVE/absences_save.Rdata")

#combine occurrences and background points
df_occ_bg <- rbind(df_occs_thinned %>% mutate(presence = 1), 
                   abs %>% mutate(presence = 0))
glimpse(df_occ_bg)


source("zarr_extraction/editoTools.R")
options("outputdebug"=c('L','M'))
load(file = "zarr_extraction/editostacv2.par")

#the requested timestep resolution of the dataset in milliseconds
#in this case we work with monthly data (1 month = 30.436875*24*3600*1000 = 2629746000 milliseconds)
timeSteps=c(2629746000)

##TODO: ADD ELEVATION, WINDFARMS AND SEABED SUBSTRATE ENERGY ------
parameters = list("thetao" = c("par" = "thetao", "fun" = "mean", "buffer" = "18520"),
                  "so" = c("par" = "so", "fun" = "mean", "buffer" = "18520"),
                  "zooc" = c("par" = "zooc", "fun" = "mean", "buffer" = "18520"),
                  "phyc" = c("par" = "phyc", "fun" = "mean", "buffer" = "18520"))

#check if they are all available in the data lake
for (parameter in parameters) {
  param = ifelse(length(parameter) == 1, parameter, parameter["par"])
  if(! param %in% unique(EDITOSTAC$par)) dbl("Unknown parameter ", param)
}

#extract function (requires POSIXct Time column)
df_occ_bg_env = enhanceDF(inputPoints = df_occ_bg %>% 
                            mutate(Time = as.POSIXct(paste(year,month,1,sep = "-"))),
                          requestedParameters = parameters,
                          requestedTimeSteps = timeSteps,
                          stacCatalogue = EDITOSTAC,
                          verbose="on")

glimpse(df_occ_bg_env)

df_occ_bg_env <- df_occ_bg_env %>% select(Longitude, Latitude, year, month,
                                          thetao, so, zooc, phyc)

# remove observations where no environmental values were present
df_occ_bg_env <- drop_na(df_occ_bg_env)
table(df_occ_bg_env$presence)
#    0    1 
# 7617 3910 


## TODO remove this part ----
substr_lvl <- tibble(sub_char = c("Fine mud", "Sand", "Muddy sand", "Mixed sediment",
                                  "Coarse substrate","Sandy mud or Muddy sand", "Seabed",
                                  "Rock or other hard substrata","Sandy mud", "Sandy mud or Muddy sand ",
                                  "Sediment","Fine mud or Sandy mud or Muddy sand"),
                     seabed_substrate = c(1:12))
energy_lvl <- tibble(ene_char = c("High energy", "Moderate energy", "Low energy", "No energy information"),
                     seabed_energy = c(1:4))

df_occ_bg_env <- df_occ_bg_env %>%
  left_join(energy_lvl, by = "seabed_energy") %>%
  left_join(substr_lvl, by = "seabed_substrate") %>%
  dplyr::select(-seabed_energy, -seabed_substrate) %>%
  mutate(windfarms = as.factor(windfarms))

save(df_occ_bg_env, file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")

1.6 Model creation using ENMeval

Maxent can be tailored by employing combinations of feature classes and regularization multipliers (S. B. Phillips et al. 2006). Fifteen combinations were tested using the corrected Akaike’s Information Criterion (AICc) as a selection criterion (Zeng, Low, and Yeo 2016).

load(file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")

glimpse(df_occ_bg_env)

# input for ENMevaluate requires lon & lat as first covariates

#test different model settings using ENMeval
model_fit <- ENMeval::ENMevaluate(occs = df_occ_bg_env %>%                                    
                                    filter(presence == 1) %>% 
                                    dplyr::select(Longitude, Latitude, depth, Phyto, SSS, SST, windfarms, ZooPl, ene_char, sub_char),
                                  bg = df_occ_bg_env %>% 
                                    filter(presence == 0) %>% 
                                    dplyr::select(Longitude, Latitude, depth, Phyto, SSS, SST, windfarms, ZooPl, ene_char, sub_char),
                                  tune.args = list(fc = c("L","LQ","LQH"),
                                                   rm = c(1,2,4,8,32)),
                                  algorithm = "maxnet",
                                  partitions = "randomkfold",
                                  categoricals = c("sub_char", "ene_char", "windfarms"),
                                  doClamp = TRUE,
                                  parallel = TRUE)


save(model_fit, file = "SAVE/model_fit_save.Rdata")

Show results from ENMevaluate

The default output shows the Area Under the Curve of the Receiver Operating Characteristic plot (AUC) metric as an evaluation metric. This metric was derived using a 5-fold cross validation.

load("zarr_extraction/SAVE/model_fit_save.Rdata")

print(model_fit %>% eval.results() %>% filter(delta.AICc == 0))
##    fc rm   tune.args auc.train cbi.train auc.diff.avg auc.diff.sd auc.val.avg
## 1 LQH  1 fc.LQH_rm.1 0.8090342        NA  0.003626047    0.003206   0.8084254
##    auc.val.sd cbi.val.avg cbi.val.sd or.10p.avg  or.10p.sd   or.mtp.avg
## 1 0.003942938          NA         NA  0.1025575 0.01737908 0.0002557545
##      or.mtp.sd     AICc delta.AICc w.AIC ncoef
## 1 0.0005718844 70485.62          0     1    32

The outcomes show that the lowest AICc score is acchieved using a regularization multiplier of 1 and the feature combinations of linear, quadratic and hinge. Using 5-fold cross-validation, the model has an AUC score of 0.81, which implies a good fit according to Krzanowski and Hand (2009).

2. Analyse model

2.1 Response curves

Visualize the partial response curves of the occurrence of larvae of herring per environemntal variable, using a bootstrapping method derived from Thuiller et al. (2009).

load(file = "zarr_extraction/SAVE/pres_abs_env_save.Rdata")

eval_res <- model_fit
opt.aicc <- eval.results(model_fit) %>% filter(delta.AICc == 0)
mod <- eval_res@models[[which(names(eval_res@models) == opt.aicc$tune.args)]]
  
vs <- c("depth", "SST", "SSS", "Phyto", "ZooPl", "windfarms", "sub_char", "ene_char")

name_key <- data.frame(old = c("depth", "SST", "SSS", "Phyto", "ZooPl", "windfarms", "sub_char", "ene_char"),
                       new = c("Depth (m)%", "Sea surface temperature (°C)%", "Sea surface salinity (PSU)%", 
                               "Phytoplankton concentration%(mmol C / m³)", "Zooplankton concentration%(g C / m²)", 
                               "Windfarm presence", "Seabed substrate", "Seabed energy"))
addline_format <- function(x,...){
  gsub('%','\n',x)
}

for (i in 1:nrow(name_key)) {

  v <- name_key$old[i]
  out_name <- name_key$new[i]
  
  dat <- response.plot(mod, v, type = "cloglog", 
                ylab = "Probability of occurrence",
                plot = F)
  
  if(is.character(dat[1,1])) {
    assign(v, ggplot(dat) +
      geom_bar(aes_string(x = v, y = "pred"), stat='identity', fill = "#332288") +
      # scale_x_discrete(limits = rev(substr_key[which(substr_key %in% dat_lv$sub_char)])) +
      scale_y_continuous(limits = c(0,1)) +
      coord_flip() +
      labs(title = out_name, x = "", y = "Probability of presence") +
      theme_bw() +
      theme(legend.title = element_blank(),
            plot.title = element_text(size=10, face = "bold", colour = "black", hjust = 0.5),
            axis.text.y = element_text(size=10, face = "plain", colour = "black"),
            axis.text.x = element_text(size=10, face = "plain", colour = "black"),
            axis.title.x = element_text(size=10, face = "bold", colour = "black"),
            axis.title.y = element_text(size=10, face = "bold", colour = "black"),
            legend.text = element_text(size=10, face = "bold", colour = "black")))
  } else {
    
    #define plot bounds (restricted to where occurrences are present)
    min <- df_occ_bg_env %>% filter(presence == 1) %>% dplyr::select(all_of(v)) %>% min
    max <- df_occ_bg_env %>% filter(presence == 1) %>% dplyr::select(all_of(v)) %>% max
    
    assign(v, ggplot(dat) +
      geom_line(aes_string(x = v, y = "pred"), linewidth = 1, color = "#332288") + 
      labs(x = addline_format(out_name), y = "Probability of presence") +
      scale_x_continuous(limits = c(min, max)) +
      scale_y_continuous(limits = c(0,1)) +
      theme_bw() +
      theme(legend.title = element_text(size=10, face = "bold", colour = "black"),
            legend.text = element_text(size=10, face = "plain", colour = "black"), 
            axis.text.y = element_text(size=10, face = "plain", colour = "black"),
            axis.text.x = element_text(size=10, face = "plain", colour = "black"),
            axis.title.x = element_text(size=10, face = "bold", colour = "black"),
            axis.title.y = element_text(size=10, face = "bold", colour = "black")))
  }
}

grid.arrange(depth, SST, SSS, Phyto, ZooPl, windfarms, ene_char, sub_char)

3. Project model

3.1 Get raster slices to project on

Make spatial predictions for the entire study area of the Northeast Atlantic. This is done by combining the created model with maps of the environemntal variables. We will make a projection of the model for January 2020 here.

Maps are derived from the .zarr files in the datalake at a given time and between a set of longitude and latitudes (defined by the study area).

variables <- c("depth", "SST", "SSS", "Phyto",
               "ZooPl", "windfarms", "seabed_substrate", 
               "seabed_energy")

files <- tibble(name = list.files("data/raster_slices/", pattern = ".tif|.grd", full.names = TRUE),
                parameter = str_extract(name, pattern = paste0(variables, collapse = "|")),
                month = str_extract(name, pattern = "\\d{2}|\\d"))

#get model
eval_res <- model_fit
opt.aicc <- eval.results(model_fit) %>% filter(delta.AICc == 0)
mod <- eval_res@models[[which(names(eval_res@models) == opt.aicc$tune.args)]]

3.2 Spatial predictions

Spatial predictions are restricted to the months were occurrence data of herring larvae is available (September, October, December and January). This is the same period where autumn-spawning herring spawn.

##TODO: delete this part
substr_lvl <- tibble(sub_char = c("Fine mud", "Sand", "Muddy sand", "Mixed sediment",
                                  "Coarse substrate","Sandy mud or Muddy sand", "Seabed",
                                  "Rock or other hard substrata","Sandy mud", "Sandy mud or Muddy sand ",
                                  "Sediment","Fine mud or Sandy mud or Muddy sand"),
                     seabed_substrate = c(1:12))
energy_lvl <- tibble(ene_char = c("High energy", "Moderate energy", "Low energy", "No energy information"),
                     seabed_energy = c(1:4))


predictions <- stack()
for (m in unique(df_occ_bg_env$month)) {
  
  plot_st <- stack(files %>% 
                     filter(month == m | is.na(month)) %>% 
                     dplyr::select(name) %>% 
                     pull())
  #get model
  
  names(plot_st) <- str_extract(names(plot_st), pattern = paste0(variables, collapse = "|"))
  names(plot_st)[which(names(plot_st) == "seabed_energy")] <- "ene_char"
  names(plot_st)[which(names(plot_st) == "seabed_substrate")] <- "sub_char"
  
  pred_m <- predict(plot_st, mod, clamp=T, type="cloglog",
                         factors = list(ene_char = factor(energy_lvl$seabed_energy,
                                                          labels = energy_lvl$ene_char,
                                                          levels = energy_lvl$seabed_energy),
                                        sub_char = factor(substr_lvl$seabed_substrate,
                                                          labels = substr_lvl$sub_char,
                                                          levels = substr_lvl$seabed_substrate)))
  # crop to extent of study area
  pred_m <- crop(pred_m, extent(-12, 10, 48, 62))
  
  names(pred_m) <- paste0("prediction_", month.name[m])

  predictions <- stack(predictions, pred_m)
}

plot(predictions)

4. References

Barve, Narayani, Vijay Barve, Alberto Jiménez-Valverde, Andrés Lira-Noriega, Sean P Maher, A Townsend Peterson, Jorge Soberón, and Fabricio Villalobos. 2011. “The Crucial Role of the Accessible Area in Ecological Niche Modeling and Species Distribution Modeling.” Ecological Modelling 222 (11): 1810–19.
ICES. 2015. “Manual for the International Bottom Trawl Surveys. Revision IX.”
Krzanowski, Wojtek J, and David J Hand. 2009. ROC Curves for Continuous Data. Chapman; Hall/CRC.
Phillips, Sharon B, Viney P Aneja, Daiwen Kang, and S Pal Arya. 2006. “Modelling and Analysis of the Atmospheric Nitrogen Deposition in North Carolina.” International Journal of Global Environmental Issues 6 (2-3): 231–52.
Phillips, Steven J, Miroslav Dudı́k, Jane Elith, Catherine H Graham, Anthony Lehmann, John Leathwick, and Simon Ferrier. 2009. “Sample Selection Bias and Presence-Only Distribution Models: Implications for Background and Pseudo-Absence Data.” Ecological Applications 19 (1): 181–97.
Phillips, Steven J, Miroslav Dudı́k, and Robert E Schapire. 2004. “A Maximum Entropy Approach to Species Distribution Modeling.” In Proceedings of the Twenty-First International Conference on Machine Learning, 83.
Sillero, Neftalı́, and A Márcia Barbosa. 2021. “Common Mistakes in Ecological Niche Models.” International Journal of Geographical Information Science 35 (2): 213–26.
Thuiller, Wilfried, Bruno Lafourcade, Robin Engler, and Miguel B Araújo. 2009. “BIOMOD–a Platform for Ensemble Forecasting of Species Distributions.” Ecography 32 (3): 369–73.
Vollering, Julien, Rune Halvorsen, Inger Auestad, and Knut Rydgren. 2019. “Bunching up the Background Betters Bias in Species Distribution Models.” Ecography 42 (10): 1717–27.
Zeng, Yiwen, Bi Wei Low, and Darren CJ Yeo. 2016. “Novel Methods to Select Environmental Variables in MaxEnt: A Case Study Using Invasive Crayfish.” Ecological Modelling 341: 5–13.